Cancer Epidemiology, Biomarkers & Prevention — Latest Matching Preprints

1

Impact of surveillance colonoscopy on colorectal cancer incidence and mortality in Lynch syndrome - a national observational cohort study of patients in the English NHS 2010-2022

Huntley, C.; Loong, L.; Mallinson, C.; Rahman, T.; Torr, B.; Allen, S.; Allen, I.; Hassan, H.; Fru, Y. W. J.; Tataru, D.; Paley, L.; Vernon, S.; Houlston, R.; Muller, D.; Lalloo, F.; Shaw, A.; Burn, J.; Morris, E.; Tischkowitz, M.; Antoniou, A. C.; Pharoah, P. D. P.; Monahan, K.; Hardy, S.; Turnbull, C.

2026-04-22 oncology 10.64898/2026.04.16.26351020 medRxiv

Top 0.1%

12.2%

Show abstract

BackgroundLynch syndrome (LS) is a cancer susceptibility syndrome caused by germline pathogenic variants in DNA mismatch repair (MMR) genes. Due to increased risk of colorectal cancer (CRC), enhanced colonoscopic surveillance is recommended for heterozygote MMR-carriers. ObjectiveUsing a registry of English LS patients linked to digital National Health Service records, we aimed to assess adherence of MMR-carriers to national surveillance guidelines, and to determine the impact of surveillance on CRC incidence and mortality. DesignWe described the frequency of colonoscopies in 4,732 MMR-carriers and used logistic regression to determine predictors of surveillance adherence. For MMR-carriers with a record of surveillance and those without, we: estimated age-specific annual CRC incidence rates (AS-AIRs) and cumulative lifetime risks, assessed for stage-shift by comparing CRC stage distributions and stage-specific AS-AIRs, and estimated risks of death from CRC and any cause using Kaplan-Meier methods and Cox Proportional Hazards regression. ResultsSurveillance at a mean interval of [≤] 3 years (n=3028) was associated with a decrease in CRC-specific and all-cause mortality, without an associated change in total CRC incidence, even after multivariate adjustment. No strong evidence of stage-shift was observed. Colonoscopic surveillance at a mean interval of [≤] 2 years (n=1569) was associated with an increase in total CRC incidence. Incidence of early-stage cancers was also higher, with no corresponding decrease in late-stage cancers, which may reflect the short follow-up period or the impact of overdiagnosis. ConclusionThe observed reduction in all-cause mortality amongst regularly-surveilled MMR-carriers may indicate an impact of surveillance on CRC-specific mortality, though in the context of a non-randomised study likely reflects the influence of selection bias. KEY MESSAGES OF ARTICLEO_ST_ABSWhat is already known on this topicC_ST_ABSRegular surveillance colonoscopy is recommended in Lynch syndrome, though evidence to support this remains mixed. We searched PubMed for articles published from inception to 01/05/2024 using the terms "Lynch syndrome", "HNPCC", "colonoscopy", "sigmoidoscopy", "surveillance", and "screening". We found one controlled trial and several small analytical studies dating from the early 2000s which compared surveilled and non-surveilled populations and found surveillance to be associated with reduced colorectal cancer (CRC) incidence and improved survival. More recent longitudinal observational studies, most without comparator groups, found a high incidence of CRC in LS populations despite being resident in countries where surveillance was recommended. A small number of studies directly assessed time since last colonoscopy against CRC incidence and stage with mixed findings. Finally, cross-sectional comparisons between countries of CRC incidence rates and surveillance interval recommendations found no relationship between the two1,2. What this study addsHere, we conduct an observational cohort study on a large national cohort of MMR germline pathogenic variant (GPV) carriers (MMR-carriers) in England (n=4,732), comparing CRC incidence and mortality in individuals with a record of regular surveillance to those without. Through linkage of the English National Lynch Syndrome Registry to Hospital Episodes Statistics data, we are uniquely able to study a comprehensive national population of MMR-carriers and identify the dates on which colonoscopies were undertaken over time, allowing assessment of adherence to national surveillance guidelines and the impact this has on CRC outcomes. Notably, receipt of regular colonoscopy was strongly associated with deprivation as well as ethnicity. The results show that regular surveillance at an average interval of 3 years (or less) is not associated with a reduction in CRC incidence when compared to less frequent surveillance, but an apparent decrease in both CRC-specific and overall mortality is observed, even after adjustment for confounding variables. Conversely, regular surveillance at an average interval of 2 years (or less) is associated with an increase in CRC incidence when compared to less frequent surveillance, which may suggest increased diagnosis of early-stage cancers or, due to the absence of a reduction in late-stage cancers, overdiagnosis. The observed impact of surveillance on overall mortality may demonstrate the impact of surveillance on CRC-specific mortality, or, in the context of an observational (non-randomised) study, indicate that the results are subject to selection bias. How this study might affect research, practice, or policyEvidence for the benefit of surveillance colonoscopy remains mixed. Whilst polypectomy would be anticipated to prevent CRC development (thus reducing CRC incidence), several studies have observed increased frequency of CRCs in MMR-carriers undergoing frequent surveillance colonoscopy, which may reflect overdiagnosis. The selection bias inherent to observational studies of surveillance renders mortality outcomes challenging to interpret. Randomised controlled trials of colonoscopic surveillance in MMR-carriers are required for effectiveness of this intervention to be accurately assessed. Given ethical and feasibility challenges, randomised controlled trials might be complemented by quasi-experimental designs using advanced observational methods for assessing effectiveness.

2

Comparative fine-mapping of breast cancer susceptibility loci using summary statistics methods and multinomial regression

O'Mahony, D. G.; Beasley, J.; Zanti, M.; Dennis, J.; Dutta, D.; Kraft, P.; Kristensen, V.; Chenevix-Trench, G.; Easton, D. F.; Michailidou, K.

2026-04-22 epidemiology 10.64898/2026.04.21.26351364 medRxiv

Top 0.1%

10.0%

Show abstract

Summary statistics fine-mapping methods offer advantages over classical methods, including avoiding data-sharing constraints and improved modelling of correlated variables and sparse effects. However, its performance has not been comprehensively evaluated in breast cancer using real-world data. Previous multinomial stepwise regression (MNR) fine-mapping analyses for breast cancer identified 196 credible sets. Here, we apply summary statistics fine-mapping, compare methods, and assess parameters influencing performance. Using summary statistics from the Breast Cancer Association Consortium, we compared finiMOM, SuSiE, and FINEMAP to published MNR results across 129 regions. Performance was assessed by recall using in-sample and out-of-sample LD. Discordant credible sets were examined for technical factors, and target genes were defined using the INQUISIT pipeline. SuSiE showed the closest agreement with MNR. Results varied across regions depending on the assumed number of causal variants (L), with higher values reducing recall and no single L maximising performance. At optimal L per region, SuSiE identified 8,192 CCVs in 244 credible sets, with recall of 88%, 86%, and 72% for overall, ER-positive, and ER-negative breast cancer. Thirty MNR sets were missed. Discordance was partially explained by allele flips, imputation quality, and array heterogeneity. Fifty-two MNR-identified genes, including BRCA2, WNT7B and CREBBP were not recovered, while additional candidate genes were identified. Using out-of-sample LD reduced recall by 3% but identified novel variants. Fine-mapping results vary across methods, and no single approach is sufficient. The choice of L strongly influences results, and combining analytical approaches with functional validation can improve causal variant identification.

3

Novel Genetic Risk Loci for Pancreatic Ductal Adenocarcinoma Identified in a Genome-wide Study of African Ancestry Individuals

Vergara, C.; Ni, Z.; Zhong, J.; McKean, D.; Connelly, K. E.; Antwi, S. O.; Arslan, A. A.; Bracci, P. M.; Du, M.; Gallinger, S.; Genkinger, J.; Haiman, C. A.; Hassan, M.; Hung, R. J.; Huff, C.; Kooperberg, C.; Kastrinos, F.; LeMarchand, L.; Lee, W.; Lynch, S. M.; Moore, S. C.; Oberg, A. L.; Park, M. A.; Permuth, J. B.; Risch, H. A.; Scheet, P.; Schwartz, A.; Shu, X.-O.; Stolzenberg-Solomon, R. Z.; Wolpin, B. M.; Zheng, W.; Albanes, D.; Andreotti, G.; Bamlet, W. R.; Beane-Freeman, L.; Berndt, S. I.; Brennan, P.; Buring, J. E.; Cabrera-Castro, N.; Campa, D.; Canzian, F.; Chanock, S. J.; Chen, Y.;

2026-04-22 genetic and genomic medicine 10.64898/2026.04.21.26351329 medRxiv

Top 0.1%

6.8%

Show abstract

Pancreatic cancer disproportionately affects Black individuals in the United States, but they have limited representation in genetic studies of pancreatic ductal adenocarcinoma (PDAC). To address this gap, we performed admixture mapping and genome-wide association analysis (GWAS) in genetically inferred African ancestry individuals (1,030 cases and 889 controls). Admixture mapping identified three regions with a significantly higher proportion of African ancestry in cases compared to controls (5q33.3, 10p1, 22q12.3). GWAS identified a genome-wide significant association at 5p15.33 (CLPTM1L, rs383009:T>C, T Allele Frequency=0.51, OR:1.45, P value=1.24x10-8), a locus previously associated with PDAC. Known loci at 5p15.33, 7q32.3, 8q24.21 and 7q25.1 also replicated (P value <0.01). Multi-ancestral fine-mapping identified two potential causal SNPs (rs3830069 and rs2735940) at 5p15.33. Collectively these findings identified novel PDAC risk loci and expanded our understanding of this deadly cancer in underrepresented populations, emphasizing the multifactorial nature of PDAC risk including inherited genetic and non-genetic factors. Statement of SignificanceTo understand how genetic variation contributes to PDAC risk in Black people in North American, we studied individuals of genetically-inferred African ancestry. We identified novel risk loci and differences in the contribution of known loci. This demonstrates that ancestry-informed genetic analyses improve our understanding of PDAC risk and enhances discovery.

4

Comparing Gleason Pattern 4 Measurement Approaches on Prostate Biopsy Using Machine Learning: A Proof-of-Principle Study

Buzoianu, M. M.; Yu, R.; Assel, M.; Bozkurt, A.; Aghdam, H.; Fine, S.; Vickers, A.

2026-04-24 oncology 10.64898/2026.04.23.26351615 medRxiv

Top 0.1%

4.4%

Show abstract

Objective: To demonstrate the proof of principle that machine learning (ML) can be used to quantify Gleason Pattern (GP) 4 on digitized biopsy slides using multiple measurement approaches, allowing direct comparison of their prognostic performance. Methods: We assembled a convenience sample of 726 patients with grade group 2-4 prostate cancer on systematic biopsy who underwent radical prostatectomy between 2014 and 2023. Digitized biopsy slides were analyzed using a machine-learning algorithm (PAIGE-AI) to quantify GP4 using multiple measurement approaches, particularly with respect to how gaps between cancer foci (interfocal stroma) were handled. GP4 extent was quantified using linear measurements or a pixel-based area metric. Discrimination of each GP4 quantification approach, along with Grade Group (GG), was assessed for adverse radical prostatectomy pathology and biochemical recurrence. Results: We identified 15 different quantification approaches and observed differences between their discrimination. The highest discrimination was in the pixel-counting method (AUC 0.648). GP4 quantification outperformed GG for predicting adverse pathology (AUC 0.627 vs 0.608). Amount of GP3 was non-predictive once GP4 was known. These findings were consistent for BCR. Conclusions: We were able to measure slides using 15 distinct measurement approaches and replicated prior findings using ML to quantify GP4. Our findings support the use of ML as a research tool to compare different GP4 quantification approaches. We intend to use our method on larger cohorts to determine with which measurement approach best predicts oncologic outcome.

5

Weight Trajectories and Cancer Risk: A Pooled Cohort Study

Nilsson, A.; da Silva, M.; Le, H. T.; Haggstrom, C.; Wahlstrom, J.; Michaelsson, K.; Trolle Lagerros, Y.; Sandin, S.; Magnusson, P. K.; Fritz, J.; Stocks, T.

2026-04-24 epidemiology 10.64898/2026.04.23.26351553 medRxiv

Top 0.1%

4.2%

Show abstract

Excess body weight has been associated with increased cancer risk, but the role of weight change across adulthood remains unclear. We examined body weight trajectories from ages 17 to 60 and their associations with site-specific cancer incidence. Data were based on the ODDS study, a pooled, nationwide cohort study in Sweden, with data on weight spanning 1911 to 2020, and cancer follow-up through 2023. Weight trajectories were estimated with linear mixed effects models in individuals with at least three weight measurements. Cox regressions estimated hazard ratios for associations between weight trajectories and established and potentially obesity-related cancers. Fifth versus first quintile of weight change was associated with many cancers, most strongly with esophageal adenocarcinoma in men (HR 2.25; 95% CI 1.66-3.04), liver cancer in men (HR 2.67; 95% CI 2.15-3.33), endometrial cancer in women (HR 3.78; 95% CI 3.09-4.61), and pituitary tumors in both sexes (men: HR 3.13 [95% CI 2.13-4.61]; women: HR 2.13 [95% CI 1.41-3.22]). Associations varied by sex and age. Heavier weight at age 17 years and earlier obesity onset were also associated with higher cancer incidence. These findings highlight the importance of a life-course approach to weight management and support sex- and age-targeted cancer prevention strategies.

6

A catalogue of missense and nonsense mutation abundances for the U.S. cancer patient population

Arun, A.; Liarakos, D.; Mendiratta, G.; McFall, T.; Hargreaves, D. C.; Wahl, G. M.; Hu, J.; Stites, E. C.

2026-04-22 oncology 10.64898/2026.04.20.26351248 medRxiv

Top 0.1%

4.1%

Show abstract

Widespread genomic sequencing efforts have characterized the molecular foundations of the different cancers. By combining these genomic data in a manner proportional to the population-level abundances of these different cancers, we estimate the overall abundances of each observed missense and nonsense mutation within the U.S. cancer patient population. We find BRAF V600E (5.2%) is the most common mutation in the cancer patient population, TP53 R175H (1.5%) is the most common tumor suppressor mutation, and APC R876X (0.4%) is the most common nonsense mutation. These values differ largely and significantly from what would be found in a typical pan-cancer analysis, where different cancer types are included out of proportion to population level incidence. We present the full ordered lists of population-level abundances for specific missense and nonsense mutations, and we demonstrate the value of these data by further analyzing high priority genes (e.g., TP53, KRAS, BRAF) and pathways (e.g., RTK/RAS, PI3K, and WNT/{beta}-catenin). Overall, this information is a resource that should benefit the basic science, translational, and clinical cancer research communities.

7

Gut Microbiome as a Diagnostic Biomarker for Early Cancer Detection: A Systematic Review and Meta-Analysis of 18 Studies across Five Cancer Types

TALL, M. l.

2026-04-22 cancer biology 10.64898/2026.04.19.719461 medRxiv

Top 0.2%

3.6%

Show abstract

BackgroundThe gut microbiome has emerged as a promising non-invasive biomarker for early cancer detection. However, evidence remains fragmented across individual studies with limited cross-cancer comparisons. ObjectivesTo systematically evaluate the diagnostic accuracy of gut microbiome-based signatures across five major cancer types: colorectal cancer (CRC), gastric cancer (GC), pancreatic ductal adenocarcinoma (PDAC), hepatocellular carcinoma (HCC), and lung cancer (LC). MethodsWe conducted a systematic literature search in PubMed, Embase, and Web of Science (January 2000 - April 2026), following PRISMA 2020 guidelines. Studies reporting area under the receiver operating characteristic curve (AUC) for microbiome-based cancer classification were included. Pooled AUC estimates were derived using a DerSimonian-Laird random-effects model. Study quality was assessed using the Newcastle-Ottawa Scale (NOS). ResultsEighteen studies (2,587 participants) met inclusion criteria. Pooled AUC values were: CRC 0.785 (95%CI 0.750-0.819; I2=30.6%), GC 0.834 (0.781-0.887; I2=56.6%), PDAC 0.853 (0.785-0.921; I2=60.8%), HCC 0.809 (0.747-0.871; I2=70.3%), and LC 0.780 (0.738-0.822; I2=25.0%). Fusobacterium nucleatum was consistently enriched across CRC, GC, and PDAC, while Faecalibacterium prausnitzii and Akkermansia muciniphila were depleted in all five cancer types. Porphyromonas gingivalis showed the highest fold-change in PDAC (log{blacksquare}FC=+2.8). Risk of bias was moderate-to-high in all studies. ConclusionsGut microbiome profiling demonstrates good-to-excellent diagnostic accuracy (AUC 0.78-0.85) across five major cancer types. Shared cross-cancer biomarkers suggest common dysbiotic mechanisms amenable to pan-cancer screening. These findings support integration of microbiome signatures into multi-modal cancer detection platforms.

8

Estimation of cancer cases in transgender and gender diverse people in England

Pasin, C.; Jackson, S. S.; Thynne, L.-E.; McWade, B.; Westerman, T.; Ball, R.; Kavanagh, J.; O'Callaghan, S.; Ring, K.; Orkin, C.; Berner, A. M.

2026-04-22 oncology 10.64898/2026.04.21.26351378 medRxiv

Top 0.2%

2.6%

Show abstract

ObjectivesTo estimate current, and 5- and 10-year projected, number of cases of cancer per year in transgender and gender diverse (TGD) people in England, overall and by tumour type, accounting for uptake of gender affirming care (GAC). DesignPopulation-based epidemiological modelling study using an age-stratified Monte Carlo simulations approach and the NORDPRED method for predictions. SettingModels estimating cancer case numbers for TGD people in England based on publicly available 2023 cancer surveillance data and survey-based 2025 GAC access, and predicted at 5 and 10 years hence. ParticipantsTGD people aged 15 years and above. Main outcome measuresPrimary cancer cases per year overall, by gender, age group, tumour type, and current and planned GAC. ResultsThe estimated TGD population size in England is 441547 (95% uncertainty interval (UI) 429207- 452890). Total cases per year of cancer in TGD people is expected to be 966 (95% UI 882-1069) excluding non-melanoma skin. Most cases are expected to occur in people aged 60-64. The top 5 expected cancers in TGD people are breast (19%, n = 187, 95% UI 149-241), colorectal (12%, n = 117, 95% UI 106-129), lung (11%, n = 108, 95% UI 96-122), melanoma (7.1%, n = 69, 95% UI 64-74) and urinary (6.2%, n = 60, 95% UI 54-67). Total cases of cancer in TGD people are estimated to be 1740 (95% UI 1584-1934) in 5 years and 2258 (95% UI 2066-2507) in 10 years (excluding non-melanoma skin). If TGD people were able to access their planned level of GAC, this would reduce these figures to 1555 (95% CI 1386-1766) and 2012 (95% CI 1797-2282) respectively. ConclusionsThis study provides prediction of cancer cases in TGD people in England, supporting the planning of service provision and training. This is vital, as with increasing disclosure, and long wait times for GAC, cancer cases in TGD people are predicted to increase. Summary BoxesO_ST_ABSWhat is already known on this topicC_ST_ABSThe annual number of cases of cancer in transgender and gender diverse (TGD) people in England is currently unknown as gender incongruence is not collected as part of the National Cancer Registration and Analysis Service. Some gender-affirming care (GAC) interventions are known to modulate cancer risk. Use of testosterone and chest reconstruction for transmasculine people is known to reduce their incidence of breast cancer compared to cisgender women. Use of oestradiol alongside medical or surgical androgen suppression has been shown to reduce the incidence of prostate cancer in transfeminine people while increasing their risk of breast cancer, compared to cisgender men. What this study addsThis study found that there are likely to be approximately 966 cases of cancer (excluding non-melanoma skin) in TGD people per year in the UK. Though total annual cases of cancer in TGD people are expected to be 2258 in 10 years, improved access to gender-affirming care could reduce total cases to 2012 (a 11% reduction). These figures provide additional justification for funding to improve access to GAC via the National Health Service (NHS), as well as for training on the oncological needs of this population.

9

Racioethnic Disparities in Risk of Cardiometabolic Risk Factors and Cardiovascular Disease among Women Treated for Breast Cancer: The Pathways Heart Study

Yao, S.; Zimbalist, A.; Sheng, H.; Fiorica, P.; Cheng, R.; Medicino, L.; Omilian, A.; Zhu, Q.; Roh, J.; Laurent, C.; Lee, V.; Ergas, I.; Iribarren, C.; Rana, J.; Nguyen-Huynh, M.; Rillamas-Sun, E.; Hershman, D.; Ambrosone, C.; Kushi, L.; Greenlee, H.; Kwan, M.

2026-04-24 epidemiology 10.64898/2026.04.23.26351612 medRxiv

Top 0.2%

2.1%

Show abstract

Background: Few studies have examined racioethnic disparities in cardiovascular disease (CVD) in women after breast cancer treatment, who are at higher risk due to cardiotoxic cancer treatment. Methods: Based on the Pathways Heart Study of women with a history of breast cancer, this analysis examines the association between cardiometabolic risk factors (hypertension, diabetes, and dyslipidemia) and CVD events with self-reported race and ethnicity, as well as genetic similarity. Multivariable logistic and Cox proportional hazards regression models were used to test race and ethnicity and genetic similarity with prevalent and incident cardiometabolic risk factors and CVD events. Results: Of the 4,071 patients in this analysis, non-Hispanic Black (NHB), Asian, and Hispanic women were more likely to have prevalent and incident diabetes than non-Hispanic White (NHW) women. Analysis of genetic similarity revealed results consistent with self-reported race and ethnicity. For CVD risk, NHB women were more likely to develop heart failure and cardiomyopathy than NHW women. In contrast, Hispanic women were at lower risk of any incident CVD, serious CVD, arrhythmia, heart failure or cardiomyopathy, and ischemic heart disease, which was consistent with the associations found with Native American ancestry. Conclusions: This is the largest multi-ethnic study of disparities in CVD health in breast cancer survivors, demonstrating corroborating findings between self-reported race and ethnicity and genetic similarity. The results highlight disparities in cardiometabolic risk factors and CVD among breast cancer survivors that warrant more research and clinical attention in these distinct, high-risk populations.

10

A Cross-Cohort Validated Plasma Lipid Biomarker Assay for Early Breast Cancer Detection Using Machine Learning

Huang, T.; Koch, F. C.; Peake, D. A.; Adam, K.-P.; David, M.; Li, D.; Heffernan, K.; Lim, A.; Hurrell, J. G.; Preston, S.; Baterseh, A.; Vafaee, F.

2026-04-23 oncology 10.64898/2026.04.23.26351564 medRxiv

Top 0.3%

1.7%

Show abstract

Early detection of breast cancer remains essential for improving clinical outcomes, and complementary non-invasive approaches are needed to support existing screening methods, particularly for women with dense breast tissue. We have previously reported plasma lipid biomarker discovery using untargeted high-resolution liquid chromatography tandem mass spectrometry (LC-MS/MS). In this study, we performed biomarker confirmation and developed machine-learning models applied to targeted plasma lipid measurements for the non-invasive detection of early-stage breast cancer across international cohorts with independent external validation. Targeted LC-MS/MS was used to quantify candidate lipid panels in plasma samples from European discovery cohorts (n = 554) and an independent Australian cohort (n = 266) used for external validation. Data-driven feature selection identified a 15-lipid panel with strong performance in European cohorts (AUC >= 0.94). External validation prior to confidence stratification yielded 76% sensitivity, 64% specificity, and an AUC of 0.81 in the Australian validation cohort. Clinical assay development requires iterative panel and model testing to support translational feasibility and performance in the intended-use population. An analytically viable panel, excluding lipids requiring complex and costly synthesis, achieved comparable accuracy with improved assay robustness. Confidence-based analysis showed enhanced performance for predictions made with moderate to high confidence, with sensitivity up to 89% and AUC up to 0.85, suggesting that ongoing research should focus on strategies to enhance diagnostic model confidence. Importantly, model predictions were independent of breast density, tumour size, grade, subtype, and morphology, indicating biological specificity of the lipid signature. These results demonstrate that calibrated machine-learning models applied to plasma lipid biomarkers can support non-invasive breast cancer detection. Expanding training datasets to include greater diversity will further improve performance in the ongoing development of this lipid-based detection approach.

11

Histology-Derived Signatures Predict Recurrence Risk and Chemotherapy Benefit in Randomized Trials of Early Breast Cancer

Howard, F. M.; Li, A.; Kochanny, S.; Sullivan, M.; Flores, E. M.; Dolezal, J.; Khramtsova, G.; Hassan, S.; Medenwald, R.; Saha, P.; Fan, C.; McCart, L.; Watson, M.; Teras, L. R.; Bodelon, C.; Patel, A. V.; Symmans, W. F.; Partridge, A.; Carey, L.; Olopade, O. I.; Stover, D.; Perou, C.; Yao, K.; Pearson, A. T.; Huo, D.

2026-04-24 oncology 10.64898/2026.04.23.26351499 medRxiv

Top 0.4%

1.3%

Show abstract

Purpose: To test whether histology-derived gene-expression signatures from routine hematoxylin and eosin slides are prognostic for recurrence and predictive of chemotherapy benefit in early breast cancer. Methods: We conducted a multi-cohort study including CALGB 9344 (anthracycline +/- paclitaxel), CALGB 9741 (standard vs dose-dense chemotherapy), a pooled Chicago real-world cohort, and the American Cancer Society (ACS) Cancer Prevention Studies-II and -3. Whole-slide images were processed with a previously described pipeline to generate 61 histology-derived signatures per patient. The primary endpoint was distant recurrence-free interval (DRFI), except in ACS, where breast cancer-specific survival was used. Secondary endpoints include distant recurrence-free survival (DRFS) and overall survival. The most prognostic signature in CALGB 9344, selected by Harrell's C-index, was evaluated in additional cohorts. Signature-treatment interaction was assessed by likelihood-ratio tests. Multivariable Cox models incorporating age, tumor size, nodal status, estrogen/progesterone receptor status, and signature were fit in CALGB 9344 to improve risk stratification. Results: A total of 7,170 patients were included across four cohorts. The top histology-derived signature in CALGB 9344 showed strong prognostic performance for 5-year DRFI (C-index 0.63) and performed well across validation cohorts (C-index 0.60, 0.70, and 0.62 in CALGB 9741, Chicago, and ACS, respectively). The strongest predictive signal for treatment benefit was observed for DRFS. High-risk cases identified by the signature demonstrated greater benefit from taxane in CALGB 9344 (adjusted hazard ratio [aHR] 0.76 for DRFS, 95% CI 0.66-0.88; interaction p=0.028), from dose-dense chemotherapy in CALGB 9741 (aHR 0.69, 95% CI 0.56-0.85; interaction p=0.039), and differential chemotherapy benefit in the Chicago cohort (aHR 0.84, 95% CI 0.59-1.21; interaction p=0.009). Combined clinical-histology models improved risk stratification and identified low-risk groups with a 2%-10% risk of distant recurrence or breast cancer death. Conclusion: Histology-derived signatures from H&E images are broadly prognostic and, unlike clinical factors, may predict chemotherapy benefit.

12

Attention-Guided CNN Ensemble for Binary Classification of High-Grade and Low-Grade Serous Ovarian Carcinoma from Histopathological WSI Patches

rani, a.; mishra, s.

2026-04-22 oncology 10.64898/2026.04.21.26351441 medRxiv

Top 0.5%

0.9%

Show abstract

Accurate histopathological differentiation between High-Grade Serous Carcinoma (HGSC) and Low-Grade Serous Carcinoma (LGSC) remains a critical yet challenging aspect of ovarian cancer diagnosis due to their similar morphology and different clinical outcomes. This study presents a deep learning framework that uses custom attention mechanisms, including the Convolutional Block Attention Module (CBAM), Squeeze-and-Excitation (SE) blocks, and a Differential Attention module within five CNN architectures for automated binary classification of ovarian cancer subtypes from H&E WSI patches. Although individual models achieved higher accuracy, the ensemble stacking framework with a shallow MLP meta-learner delivered the best overall performance, with a ROC-AUC of 0.9211, an accuracy of 0.85, and F1-scores of 0.84 and 0.85 across both subtypes. These findings demonstrate that attention-guided feature recalibration combined with ensemble stacking provides robust and clinically interpretable discrimination of ovarian carcinoma subtypes.

13

Semaglutide is associated with improved breast cancer survival, lower metastatic burden, and a dose-survival relationship uncoupled from weight-loss magnitude

Murugadoss, K.; Venkatakrishnan, A. J.; Soundararajan, V.

2026-04-24 oncology 10.64898/2026.04.23.26351609 medRxiv

Top 0.7%

0.6%

Show abstract

Metabolic dysfunction is increasingly recognized as a risk factor for poor outcomes in breast cancer, but whether incretin-based therapies confer survival benefit beyond weight loss remains unresolved. Using a federated electronic health record platform spanning nearly 29 million patients, we evaluated breast cancer survival after semaglutide and tirzepatide initiation in routine care. In 1:1 propensity-matched pooled-comparator analyses, semaglutide was associated with improved overall survival versus metformin, sodium-glucose cotransporter 2 (SGLT2) inhibitor, and dipeptidyl peptidase 4 (DPP4) inhibitor users, with 54 deaths among 2,433 semaglutide users (2.2%) versus 395 deaths among 2,433 comparators (16.2%) over 24 months (log-rank P < 0.001). Tirzepatide showed a favorable survival association relative to pooled anti-diabetic comparators that did not meet statistical significance (P = 0.24), with 3 deaths among 220 users (1.4%) versus 64 deaths among 220 comparators (29.1%). In a head-to-head propensity-score-matched comparison, overall survival did not differ significantly between semaglutide and tirzepatide treated patients with pre-existing breast cancer (2,117 per arm; P = 0.12). In semaglutide-treated patients alive and observable at the 1-year landmark, higher maximum dose achieved was significantly associated with lower post-landmark mortality (P = 0.034), with an event rate of approximately 1.0% in the high-dose group (>=1.7 mg) versus approximately 4.5% in the low-dose group (0.25-1.0 mg). Despite a linear dose weight loss relationship for semaglutide, however, weight loss strata did not separate survival outcomes (global P = 0.22). In tirzepatide-treated patients alive and observable at the same landmark, neither maximum dose achieved nor weight loss strata separated post-landmark survival (P = 0.98 and P = 0.50, respectively). Structured EHR and AI-based clinical note analyses further showed significantly lower frequency of documented metastatic disease in semaglutide-treated patients relative to pooled anti-diabetic comparators, including any metastasis (7.0% versus 15.0%, rate ratio 0.5, P < 0.001), bone metastasis (1.0% versus 5.2%, rate ratio 0.2, P < 0.001), and liver, lung, or brain metastases (all P < 0.001). LLM-derived cause-of-death extraction further showed a 60% lower relative proportion of cancer-associated deaths in semaglutide-treated patients (19% of ascertainable deaths) than in matched pooled anti-diabetic comparators (47% of ascertainable deaths), with comparator deaths more often attributed to cancer progression involving metastatic breast cancer, leptomeningeal carcinomatosis, and cancer-driven organ failure. Overall, this study demonstrates that semaglutide use in patients with pre-existing breast cancer is associated with a dose correlated but weight loss independent improvement in overall survival. These findings motivate prospective trials of GLP-1 receptor agonists in breast cancer across various stages and treatment settings.

14

Vaginal metabolome signatures of high-risk HPV infection trajectories in HIV-negative premenopausal women

Adebamowo, C.; Adebamowo, S. N. N.; Gbolahan, T.; Ikwueme, O.; Famooto, A.; Owoade, Y.; ACCME Research Group as part of H3Africa Consortium,

2026-04-22 epidemiology 10.64898/2026.04.21.26351401 medRxiv

Top 0.8%

0.5%

Show abstract

Persistent detection of high-risk human papillomavirus (HPV) is required for cervical carcinogenesis, yet the metabolic phenotype associated with distinct HPV transition states remains incompletely defined. We analyzed vaginal metabolomics data from 71 HIV-negative, non-smoking, premenopausal women without other sexually transmitted infections, grouped by three-visit HPV trajectories: persistent negative (NNN, n=20), late incident positivity (NNP, n=9), conversion with persistence (NPP, n=13), clearance after prior positivity (PPN, n=16), and persistent positive (PPP, n=13). After detection-based filtering, 186 putative and 64 quantitatively estimated metabolites were retained for integrated univariate, multivariate, network, pathway, and machine learning analyses. Global class separation was weak by PERMANOVA and by five-class classification, indicating that the vaginal metabolome does not reorganize broadly across all HPV states. In contrast, trajectory-specific signals were reproducible. The strongest pairwise contrast was NNP versus PPP (best cross-validated ROC AUC 0.778; permutation p=0.039). Glycolic acid was the dominant single metabolite, particularly for NNP versus PPP (Mann-Whitney p=6.96x10^-4, FDR=0.0446, AUROC=0.902; detection 88.9% versus 15.4%; combined abundance+detection FDR=0.0010). Persistent positivity was characterized by a focused uracil-high, methyl-donor/redox-low signature, including lower glycolic acid, S-adenosylmethionine, NAD+, and betaine, together with higher uracil. Ratio mining further sharpened discrimination, with uracil/S-adenosylmethionine and uracil/creatinine among the best PPP classifiers, and glucose 1-phosphate/isovaleric acid-valeric acid strongly separating NNP from NPP. These data support a model in which HPV trajectory is encoded by targeted metabolic states rather than a diffuse HPV-positive versus HPV-negative metabolomic shift.

15

Integrated Single-Cell and Spatial Profiling of MMP Gene Expression in Colorectal Cancer

Danese, N. A.; Kurkcu, S. R.; Bleiler, M.; Nito, K.; Kuo, A.; Rosenberg, D. W.; Nakanishi, M.; Giardina, C.

2026-04-21 cancer biology 10.64898/2026.04.17.719089 medRxiv

Top 0.8%

0.5%

Show abstract

Increased matrix metalloproteinase (MMP) expression has long been recognized as a common feature of colorectal cancers (CRCs), yet less is known about how these enzymes interact to impact cancer progression. Taking advantage of single-cell and spatial transcriptomic data, we analyzed the cell-type-specific and spatial expression of MMPs in CRCs. Distinct colon cancer-associated fibroblast (CAF) subtypes were found to express different MMP combinations, including MMP1/3-expressing and MMP11-expressing CAFs. Conversely, myeloid cells (monocytes, macrophages, and dendritic cells) expressed varying levels of the "myeloid MMPs" 9, 12, and 14, which correlated closely with secretory gene expression. Finally, a small population of cancer cells expressed high levels of MMP7. The MMP7-expressing cancer cells frequently co-expressed MMP1, MMP14, and several Wnt-related genes, consistent with a cancer cell type at high risk of malignancy and metastasis. Spatial transcriptomic data showed MMP expression in discernible clusters driven in part by cell-type localization, including fibroblast-heavy stromal regions and inflammatory cell hubs. Epithelial-rich areas showed subregions of MMP7-expressing cancer cells, including areas where cancer cell and myeloid MMP expression overlap. Tumors showed a wide variation in MMP1-expressing CAFs, a variation reflected in primary CAF cell lines. In vitro, MMP1 expression was a stable phenotype that persisted through multiple rounds of division. MMP1-expressing CAFs were frequently positioned at the stromal interface, suggesting a role in facilitating cell movement across the tumor boundary. Our analysis indicates that cell-type and positional MMP expression varies between tumors and may play a role in determining lesion progression and cancer spread.

16

Metabolomic Profiling of Dried Blood Spots for Breast Cancer Detection: A Multi-Classifier Validation Study in 2,734 Participants

Anctil, N.; Hauguel, P.; Noel, L.-P.

2026-04-27 oncology 10.64898/2026.04.24.26351695 medRxiv

Top 0.9%

0.4%

Show abstract

Background. Breast cancer (BC) remains the most diagnosed malignancy and leading cancer-related cause of mortality in women worldwide. Although blood-based untargeted metabolomics has emerged as a promising modality for detecting early-stage BC, the clinical translation of this approach has been bottlenecked by two unresolved issues: (i) the field has almost exclusively relied on serum or plasma, which require venipuncture and cold-chain logistics, and (ii) machine-learning models reported on such data are frequently validated with protocols that are blind to analytical batch structure, producing optimistically biased performance estimates. Methods. We present a breast cancer detection study based on dried blood spots (DBS), an analytical matrix that enables self-collection and ambient-temperature shipping. A cohort of 2,734 participants (114 biopsy-confirmed BC cases; 2,620 non-cancer controls) was profiled by untargeted LC-MS/MS on a Thermo Scientific Orbitrap IQ-X coupled to a Vanquish UHPLC. A 39-metabolite panel meeting MSI Level 1 identification criteria was pre-specified a priori from the published breast-cancer metabolomics literature, frozen prior to LC-MS acquisition, and applied to the present cohort without any feature selection on the data. Six standard supervised-learning architectures (LASSO, Elastic Net, Linear SVM, PLS-DA, OPLS-DA, XGBoost) were evaluated on this pre-specified panel; OPLS-DA is reported only in the sex-matched subgroup analysis where a single-seed 5-fold stratified protocol permits a directly comparable fit. Per-batch control-median normalization is applied upstream; kNN imputation, log transform, and robust scaling are fit within each training fold. The evaluation battery comprises batch-aware StratifiedGroupKFold CV at single-seed (seed=42) with inter-seed SD quantified across 10 independent seeds, batch-aware nested CV, a 100-seed held-out 20%-batch validation with disjoint-batch isotonic probability calibration (30% calibration partition), PPV/NPV reporting at multiple operating points and three deployment prevalences, subgroup analyses by TNM stage and tumor grade, pathway-ablation sensitivity analysis, and a 1,000-iteration permutation test. Results. Under batch-aware evaluation (StratifiedGroupKFold, single-seed=42), AUC ranged from 0.914 to 0.949 across classifiers, with LASSO achieving 0.928 and XGBoost 0.949; inter-seed SD across 10 seeds was 0.002-0.006. At 95% specificity, LASSO reached 75.4% sensitivity and XGBoost 81.6%. Held-out batch validation (100 seeds) yielded mean AUC 0.912 for Elastic Net and 0.935 for XGBoost, confirming robust generalization. All 39 panel features showed high coefficient stability, and permutation testing on representative classifiers (LASSO, Linear SVM, PLS-DA) yielded p <= 0.001. Subgroup analyses showed weaker detection of stage IIA tumors (AUC 0.87, n=40) compared with stage IIB/IIIA (AUC 0.95), consistent with stronger metabolic signatures in more advanced disease. Bootstrap coefficient consistency of the Elastic Net classifier confirmed that all 39 panel features received a non-zero multivariate weight in >=80% of 100 stratified bootstraps. Conclusions. On this cohort of diagnosed, pre-treatment breast-cancer cases, DBS LC-MS metabolomic profiling delivers classification performance (AUC 0.928 for LASSO and 0.949 for XGBoost under batch-aware GroupKFold CV at single-seed=42; held-out AUC 0.912-0.935) that is robust across classifier families and biological pathways. The DBS matrix is non-radiating, self-collectable by finger-prick, and mailable at ambient temperature. Performance is weaker on stage IIA than on more advanced disease, and prospective validation in an independent asymptomatic screening cohort is required before clinical positioning as a decentralized triage modality.

17

Onca: An Open 9B Language Model for Pancreatic Cancer Clinical Tasks

Shim, K. B.

2026-04-24 oncology 10.64898/2026.04.16.26351055 medRxiv

Top 1%

0.3%

Show abstract

Pancreatic ductal adenocarcinoma (PDAC) remains one of the deadliest solid tumors and continues to face low treatment-trial participation, fragmented evidence workflows, and labor-intensive ab- straction of unstructured clinical text. Existing oncology-focused language models show promise, but many depend on private institutional corpora, limiting reproducibility and practical reuse across centers. We present Onca, an open 9B dense model designed for four PDAC-relevant tasks: trial eligibility screening, case-specific clinical reasoning, structured pathology report extraction, and molecular variant evidence reasoning. Onca is fine-tuned from Qwopus3.5-9B-v3 with a single Un- sloth BF16 LoRA adapter on 37,364 training rows drawn from openly available sources. The evalu- ation spans 11 panels and compares Onca against Woollie-7B, CancerLLM-7B, OpenBioLLM-8B, and the unmodified Qwopus base. Onca achieves the strongest overall results on Trial Screening (81.6 F1), Clinical Reasoning (14.1 composite), Pathology Extraction (30.5 field exact-match), Pub- MedQA Cancer (68.3 macro-F1), and PubMedQA (66.5 macro-F1). The strongest gains appear in tasks closest to routine oncology workflow, especially trial review and pathology structuring. These findings suggest that clinically targeted pancreatic-cancer language models can be built from open data with competitive performance while remaining practical to train on a single workstation-scale GPU setup.

18

Determinants of DNA-sequence-based Diagnostic Yield in the CSER Consortium

Mavura, Y.; Crosslin, D.; Ferar, K. D.; Lawlor, J. M.; Greally, J. M.; Hindorff, L.; Jarvik, G. P.; Kalla, S.; Koenig, B. A.; Kvale, M.; Kwok, P.-Y.; Norton, M.; Plon, S. E.; Powell, B. C.; Slavotinek, A.; Thompson, M. L.; Popejoy, A. B.; Kenny, E. E.; Risch, N.

2026-04-22 genetic and genomic medicine 10.64898/2026.04.20.26351140 medRxiv

Top 1%

0.3%

Show abstract

PurposeDiagnostic yield from exome and genome sequencing varies widely across studies. It remains unclear how much of this variation reflects patient-level factors (e.g., sex, clinical features, race/ethnicity, genetic ancestry) versus site-level practices such as sequencing modality or variant interpretation workflows. We aimed to quantify the contributions of these factors to diagnostic outcomes across five U.S. clinical sequencing sites. MethodsWe performed a cross-sectional analysis of 3,008 prenatal, neonatal, and pediatric cases from the NHGRI Clinical Sequencing Evidence-Generating Research (CSER) consortium (2017-2023). Clinical indications spanned neurodevelopmental, neurological, immunological, metabolic, craniofacial, skeletal, cardiac, prenatal, and oncologic presentations. Genetic ancestry was inferred from sequencing data, and variants were interpreted using ACMG/AMP guidelines to classify DNA-based diagnoses. Generalized linear mixed models were used to estimate associations between diagnostic yield and fixed effects (sex, prenatal status, isolated cancer, number of clinical indications, sequencing modality, race/ethnicity, and genetic ancestry), while modeling study site as a random effect to quantify between-site variation. ResultsThe overall diagnostic yield was 19.0%. Multiple clinical indications (OR=1.47, 95% CI 1.20-1.80, p<0.001) were associated with higher diagnostic yield, and male sex (OR=0.80, 95% CI 0.66-0.96, p=0.017) and prenatal status (OR=0.63, 95% CI 0.44-0.90, p=0.012) were associated with lower yield. Sequencing modality, race/ethnicity, genetic ancestry, and isolated cancer were not statistically significantly associated with diagnostic outcomes.. A model without fixed effects attributed [~]10% of variance in diagnostic yield to between-site differences. After adjusting for covariates, site-level variance decreased to 5.7%, indicating consistent variation across sites not explained by measured patient factors. ConclusionAcross five sites, patient-level clinical features influenced diagnostic yield, but substantial site-level variation remained even after adjustment. Differences in variant interpretation, or case-classification practices may contribute to this residual variability. Further efforts to increase consistency in exome- and genome-sequencing diagnostic workflows may help reduce inter-site differences.

19

CT-Based Deep Foundation Model for Predicting Immune Checkpoint Inhibitor-Induced Pneumonitis Risk in Lung Cancer

Muneer, A.; Showkatian, E.; Kitsel, Y.; Saad, M. B.; Sujit, S. J.; Soto, F.; Shroff, G. S.; Faiz, S. A.; Ghanbar, M. I.; Ismail, S. M.; Vokes, N. I.; Cascone, T.; Le, X.; Zhang, J.; Byers, L. A.; Jaffray, D.; Chang, J. Y.; Liao, Z.; Naing, A.; Gibbons, D. L.; Vaporciyan, A. A.; Heymach, J. V.; Suresh, K. S.; Altan, M.; Sheshadri, A.; Wu, J.

2026-04-23 oncology 10.64898/2026.04.21.26351428 medRxiv

Top 1%

0.2%

Show abstract

Background: Immune checkpoint inhibitors (ICIs) have revolutionized cancer therapy but can cause serious immune-related adverse events (irAEs), with pneumonitis (ICI-P) being among the most severe. Early identification of high-risk patients before ICI initiation is critical for closer monitoring, timely intervention, and improved outcomes. Purpose: To develop and validate a deep learning foundation model to predict ICI-P from baseline CT scans in patients with lung cancer. Methods: We designed the Checkpoint-Inhibitor Pneumonitis Hazard EstimatoR (CIPHER), a deep learning foundation model that combines contrastive learning with a transformer-based masked autoencoder to predict ICI-P from baseline CT scans in patients with lung cancer. Using self-supervised learning, CIPHER was pre-trained on 590,284 CT slices from 2,500 non-small cell lung cancer (NSCLC) patients to capture heterogeneous lung parenchymal patterns. After pre-training, the model was fine-tuned on an internal NSCLC cohort for ICI-P risk prediction, using images from 254 patients for model development and 93 patients for internal validation. We compared CIPHER with classical radiomic models and further evaluated it on an external NSCLC cohort of 116 patients. Results: In the internal immunotherapy cohort, CIPHER consistently distinguished patients at elevated risk of ICI-P from those without the event, with AUCs ranging from 0.77 to 0.85. In head-to-head benchmarking, CIPHER achieved an AUC of 0.83, outperforming the radiomic models. In the external validation cohort, CIPHER maintained strong performance (AUC = 0.83; balanced accuracy = 81.7%), exceeding the radiomic models (DeLong p = 0.0318) and demonstrating higher specificity without sacrificing sensitivity. By contrast, the radiomic model showed high sensitivity (85.0%) but markedly lower specificity (45.8%). Confusion matrix analysis confirmed the robust classification performance of CIPHER, correctly identifying 80 of 96 non-ICI-P cases and 16 of 20 ICI-P cases. Conclusions: We developed and externally validated CIPHER for predicting future risk of ICI-P from pre-treatment CT scans. With prospective validation, CIPHER may be incorporated into routine patient management to improve outcomes.

20

Chinese Herbal Medicine as a complementary therapy for the management of Colorectal Cancer: Study protocol for a Delphi Expert Consensus survey

Ng, C. Y.; Liu, M.; Ai, D.; Yao, L.; Yang, M.; Zhong, L. L.

2026-04-22 oncology 10.64898/2026.04.21.26350990 medRxiv

Top 1%

0.2%

Show abstract

IntroductionColorectal cancer (CRC) remains a leading cause of cancer-related morbidity and mortality worldwide, despite advances in conventional oncological therapies. In recent years, various studies have made advances in integrative oncology, such as investigating the use of Chinese Herbal Medicine (CHM) as a complementary therapy alongside conventional oncological therapies to alleviate treatment-related adverse effects, improve quality of life, and potentially enhance therapeutic outcomes. Despite this, clinical practice in this area remains highly heterogeneous, with limited standardized guidelines on key areas of concern such as (1) optimal intervention, (2) recommended stage and duration of intervention, (3) safety considerations, and (4) possible herb-drug interactions. Hence, this study aims to establish expert consensus on the usage of CHM as a complementary therapy in the management of CRC, to support safe, consistent, and evidence-informed clinical practice. Methods and AnalysisWe will employ a modified Delphi technique to achieve consensus amongst a panel of international experts in various fields related to integrative oncology. Prior to the study, a list of questionnaire items was developed based on a systematic review of existing clinical practice guidelines on CRC. An international panel will be invited based on established international profile in integrative oncology research and clinical practice, and by peer referral. Two rounds of Delphi will be conducted using anonymous online questionnaires. Consensus will be considered reached if at least 50% of the panel strongly agree/disagree that an item should be included or excluded while strong consensus will be set at 76%. Items which achieve strong consensus after Round 1 will be removed, before being sent out for Round 2 with a summary of Round 1 responses for a final consensus. Ethics and DisseminationEthics approval has been obtained from the Institutional Review Board of Nanyang Technological University (IRB-2025-1222). Our findings will be disseminated through peer-reviewed publications and conference presentations. Strengths and limitations of this studyO_LIThis study will develop an expert consensus which aims to guide future integration of Chinese Herbal Medicine (CHM) as a complementary therapy into colorectal cancer (CRC) management. C_LIO_LIKey concerns in areas such as determining the (1) optimal intervention, (2) recommended stage and duration of intervention, (3) safety considerations, and (4) possible herb-drug interactions, thereby laying the groundwork for potential future incorporation of CHM into CRC treatment protocols alongside conventional oncology approaches has been identified, thus limiting implementation in clinical practice. C_LIO_LIDesigning a study e-guide, followed by the consensus rounds study online will facilitate participants responses and the dissemination of information from previous rounds. C_LI